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Abstract — The prevalence of hidden Markov models (HMMs) 
in various applications of statistical signal processing and commu- 
nications is a testament to the power and flexibility of the model. 
In this paper, we link the identifiability problem with tensor 
decomposition, in particular, the Canonical Polyadic decomposi- 
tion. Using recent results in deriving uniqueness conditions for 
tensor decomposition, we are able to provide a necessary and 
sufficient condition for the identification of the parameters of 
discrete time finite alphabet HMMs. This result resolves a long 
standing open problem regarding the derivation of a necessary 
and sufficient condition for uniquely identifying an HMM. We 
then further extend recent preliminary work on the identification 
of HMMs with multiple observers by deriving necessary and 
sufficient conditions for identifiability in this setting. 

I. Introduction 

The hidden Markov model (HMM) was first introduced in 
the late 1950s by Blackwell and Koopmans f^l and generalised 
later by Baum and Petrie |2|. HMMs have been applied to 
a variety of domains, such as signal processing, machine 
learning, communications and many more, with particular 
emphasis on the inference of the parameters of the HMM, 
in particular, the hidden states of the system. Typically, an 
unbiased, or asymptotically unbiased, estimator such as a 
maximum likelihood estimator, is used to infer these states, 
using algorithms such as the famed Baum-Welch algorithm 
O. However, identifiability conditions, required to ensure the 
existence of an unbiased estimator, are generally not well- 
known, with only a select number of works proposing these 
conditions 12, ||6l, Q- These conditions are probabilistic and 
difficult to verify in practice. 

In this paper, we derive an identifiability condition for 
a stationary discrete time HMM, where the observations 
are the realisations of a probabilistic function. We show 
a strong connection between the identifiability of HMMs 
and the uniqueness of the Canonical Polyadic (CP) tensor 
decomposition (see |9|). Specifically through a tensor model 
called the restricted CP model, we derive a necessary and 
sufficient condition by using a result by Kruskal |fTOl|, called 
the permutation lemma. Our main result resolves an open 
problem regarding the derivation of a necessary and sufficient 
condition for uniquely identifying an HMM. A highlight of 
our results is that the condition is deterministic, compared 
to generic (probabilisitic) identifiability results. They are also 
easier to verify compared to previous conditions. 

These results are particularly helpful in studying the re- 
cently proposed multi-observer HMMs [ilTI . ifTZl . We consider 
two settings: the homogeneous setting where all observers 



possess the same observation matrix, and the heterogeneous 
setting, where at least two observers have distinct observation 
matrices. Surprisingly, the condition for identifiability in the 
homogeneous setting is equivalent to having just a single 
observer of the HMM. Thus, if the HMM cannot be identified 
with a single observer, no additional number of independent 
homogeneous observers can hope to identify the hidden states. 
The heterogeneous setting is shown to provide a significant 
advantage over the homogeneous setting, due to sufficient vari- 
ability of the observations, contributed by different viewpoints 
of the independent observers. 

The rest of this section introduces the notation used through- 
out the paper Section formulates our problem and defines 
the HMM. Section [III| provides an overview of the CP decom- 
position and some important results in tensor decomposition 
that we will invoke when proving our results. Section HV] 
derives the identifiability condition of HMMs with only one 
observer present. Section fV] further extends our framework to 
the multi-observer setting. Finally, we conclude and outline 
some future work in Section |VT] We defer proofs and more 
details to our technical report lfT6l . 

Some notation and definitions are in order All vectors and 
matrices are represented with lower and upper case boldface 
fonts respectively. Sets are represented with calligraphic font. 
Random variables are represented with italic fonts while their 
realisation is represented by lower case italic fonts. Tensors 
are represented by upper case, calligraphic boldface fonts. Let 
In be the identity matrix of size n x n, while 1„ denotes 
a column vector of n ones, supp(x) denotes the support of 
vector X. 

The Kronecker, or tensor, product between two matrices 
A e M"^" and B e MP'"' is denoted by A ® B e M"PX"9. 
We also define a row-wise tensor product, which we borrow 
from fTl, where with matrices A e R™x"i and B e M^^''^ 
with rows 3.1,3.2, - ■ ■ , a„j and bi, b2, • • • , respectively, 
the row-wise tensor product is equivalent to 
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This definition is related to the Khatri-Rao product, which is 
the column-wise tensor product. Note that for row vectors a 
and b, a 0''°^ b = a (g) b. All other notation will be defined 
on an as needed basis. 



II. Hidden Markov Models 



2/1, y2, • • ■ ,yN may be described by 



Throughout the paper, we only consider the discrete time 
finite alphabet HMM. Time is organised into regularly spaced 
discrete intervals. Let {Xt}t>i be the non-observable states 
of an irreducible, aperiodic Markov chain and {Yt}t>i be 
the observable states, both at time t, called an observation 
process. Thus, {Xt}t>i constitutes the hidden Markov chain, 
as it cannot be directly measured. Without loss of generality, 
let the alphabets of Xt and Yt be the sets X = {1,2, ■ ■ ■ ,q} 
and 3^ = {1, 2, • • • , k} respectively. We further assume q > 2 
and K > 2. 

We assume {Xt}t>i is stationary for simplicity of expo- 
sition, although our results apply to non-stationary Markov 
chains as well, with appropriate modifications. The observa- 
tions {Yt}t>i are assumed to be i.i.d. with Yt only depen- 
dent on Xt- The HMM is described by the joint process 
{Xt, Yt) € X xy for all t = 1, 2, • ■ • , iV, in terms of the 
state space model 

Xt+i^f{Xt), Yt^giXt), 

with the initial state described by an initial random variable 
Xi, from an initial distribution Pi{Xi — i) — tTj, denoted by 
the q-length row vector tt. The function /(•) is probabilistic, 
and obeys the q x q transition matrix A, where its (i,j)-th 
element is aij :— Pr{Xt+i = j \ Xt — i). The function g{-) 
may be deterministic, but here, we consider it a probabilistic 
function, with the transition of observation states described by 
the qx K observation matrix B, where the (i, j)-th element is 
bij :— Pr(Yf — j \ Xt ~ i). The function g(-) is assumed to be 
surjective. Since the Markov chain is assumed to be stationary, 
the observation process {Yt}t>i is stationary as well. Also, the 
observation process {yf}f>i may not be a Markov chain in 
general, even though the input process is. We are now in a 
position to formalise HMMs. 

Definition 1: A discrete time finite alphabet HMM is pa- 
rameterised by the set A = {tt; q, k. A, B}: 

• tt: initial state probabilities, which may or may not be 
sampled from a stationary distribution, 

• q: number of hidden states of Xt, i.e. \X\ = q, 

• k: number of observation states of Yt, i.e. \y\ — n, 

• A: q X q transition matrix of hidden states Xt, and 

• B: g X K observation matrix of observation process Yt. 

One way of measuring the complexity of the HMM is 
by the number of states required to describe the Markov 
chain. The order of an HMM is minimum of jA"! amongst 
all representations [6j. An HMM is minimal if it has a 
representation such that \X\ is equal to its order 

An observation letter yt is defined as a single realisation 
of the observation process at time t. A sequence from time ti 
to t2 is defined as a series of consecutive observation letters 
from time ti to t2. A sequence has length N if it consists of 
N observation letters. 

The joint probability of a particular observed sequence 



P\{Yi = vi,Y2 = y2r ■ ■ ,Yn ^ vn) 

= 7rWE(yi)WE(y2)---WE(yAr)l„ (1) 

where E(fc) is a Kg x g matrix with the q x q identity matrix 
in the fc-th row partition, and 

W = B «)™™ A = [Di(B)A D2(B)A • • • Dk(B)A] , 

where Dfc(B) denotes the diagonal matrix with the fc-th 
column of B lying on its diagonal. 

A. Equivalence and identifiability of HMMs 

The observation process {Yt}t>i is assumed to admit a 
representation of a Markov chain with q states. It is possible 
to construct an HMM with more states that generates the 
same observation process. Let the process be alternatively 
parameterised by the set A = {tt; g, k, A, B}. Equivalence 
of HMMs is defined as follows: 

Definition 2: Two HMMs with parameterisations A and A 
respectively are equivalent if and only if for all sequences 

yi,y2,--- ,yN, 

Px{Yi = yi, = 2/2, • • • , = yAf) 

for any integer > 1. 

There are two types of identifiability: deterministic and 
generic identifiability. Deterministic identifiability implies that 
an HMM satisfying the condition can always be identified. 
Generic identifiability means that the HMM is identified 
with probability 1, i.e. identifiability holds everywhere except 
for some model parameters that lie in a set of Lebesgue 
measure zero. Allman et al. [T| defines it as all nonidentifiable 
parameters of the model lying in a proper subvariety. 

Our main result is a condition when an HMM can or cannot 
be deterministically identified. 

III. A Summary on Tensors 

As shown in ([T]l, the joint probability of a sequence can 
be expressed as matrix multiplication of row tensor products. 
Identifiability then simply boils down to decomposing the 
product into factors via tensor decomposition. 

The tensor is essentially a multidimensional array of num- 
bers, with a general overview found in (|9]. The tensor order 
is the number of indices required to unambiguously label a 
component of the tensor, called a way. 

One particular important decomposition of a tensor is the 
Canonical Polyadic (CP) decomposition. Let the components 
of a tensor be A e B G and C e R'^p, written 

succinctly as [A, B,C]. Then a tensor X constructed from 
these components is expressed as 

[AT; A,B,C] = b, 0Ci, (2) 

i=l 



where ai,bi,Ci, i — 1.2- - ,q are the rows of A, B and 
C respectively, essentially a decomposition to q single rank 
tensors. A tensor is irreducible with q components if and 
only if it cannot be decomposed to fewer than q components. 
A tensor X is permutation and scaling indeterminate if its 
components are unique up to a scaling and permutation of 
rows. Hence, for any alternative decomposition of X with 
components [A, B,C], there exists a permutation matrix 11 
and nonsingular scaling matrices Aa, Ab and Ac where 
AaAbAc = I,, such that A = IIAaA, B — IIAbB and 
C — IIAcC. Finally, an equivalent representation of a tensor 
is given by its mode matricisation. For example, the first mode 
matricisation of X is A^(B (g)'°* C), while its second mode 
matricisation is B'^(C (g)'°* A). 

Surprisingly, under certain mild conditions, a tensor of 
order 3 and above can yield a unique CP decomposition, 
up to a scaling and permutation of rows of the components, 
unlike matrices. A general sufficient condition was proposed 
by Kruskal ifTOl . with its generalisation in lfT4l . 

Central to Kruskal's and our results is the concept of the 
Kruskal rank defined below: 

Definition 3: The Kruskal rank of a matrix X, krank(X) 
is defined as the largest integer K such that any subset of K 
rows is linearly independent. 

Unlike the rank of a matrix, the Kruskal rank changes when 
one defines it for columns instead. Here, we stick to the 
above definition for rows, since this is directly relevant to our 
discussions. 

The cornerstone of Kruskal's result, as pointed out by fSl, 
ifTSJ is Kruskal's permutation lemma, here modified for rows. 
The lemma is key to our proposed identifiability condition. 

Lemma 4 (Permutation lemma): Given two matrices H and 
H, both with size q x r, suppose that H has no identically 
zero rows, and assume the following implication holds for all 
column vectors x: 

|supp(Hx)| <q- rank(H) + 1 

implies that |supp(Hx)| < |supp(Hx)|. 

Then, H = HAH, where IT is a permutation matrix and A 
is a nonsingular diagonal scaling matrix. 

IV. Single observer HMM Setting 

Intuitively, the identifiability of an HMM rests on the 
number of states q and the number of observation states k, 
both having a direct relationship with A and B respectively. 
Our reformulation using the properties of tensors allows us to 
explore this relationship. Since the underlying Markov chain 
is assumed to be irreducible and aperiodic and assuming 
all alphabets in y are not redundant, krank(A) > 1 and 
krank(B) > 1 respectively. 

A. Main result 

Our main result is the following: 

Theorem 5: For an HMM parameterised by A to be unique 
up to a scaling and permutation of states, it is necessary and 
sufficient that krank(B (g)™"^ A) = q. 



Previous work |j6l, Q studied the class of regular HMMs. 
An HMM is regular if there exists a set of 2q sequences whose 
joint probabilities can be described by a product of two linear 
subspaces of dimension q. A regular HMM is permutation and 
scaling indeterminate. Finesso |61 provided a simple sufficient 
condition for equivalence between two HMMs. Additionally, 
the author proved a necessary condition for a regular HMM 
to be equivalent to another HMM. The proof is probabilistic 
as he showed the set of parameters A of HMMs almost surely 
leads to regularity in the Lebesgue measure. 

Our result differs from Finesso's in the sense that we 
show an interaction between the hidden and the observation 
states, dispensing with assumption of regularity of the HMM 
and replacing it with a deterministic condition. Furthermore, 
regular HMMs are also minimal, implying krank(A) — q 
(see llT6l ). The result shows there is no longer any restriction 
to regular HMMs, or as coined by Finesso [6|, a Petrie 
point after Petrie's work [13] on regular HMMs, since the 
deterministic condition covers all possible cases. In this sense, 
the result is the strongest to date on the identifiability, whether 
deterministic or generic, of HMMs. 

Theorem |5] is a consequence of the properties of a specific 
restricted CP tensor, where one mode of the tensor is full rank 
[81, which we call the per letter tensor. Let us consider C, 
a three way tensor of dimensions q x q x k, with component 
matrices [A,Ig,B]. For each element of C, 

:= Px{Xt+i =t\Xt= j) ■ Px{Yt = k\Xt^ j) 
^P^{Yt^k,Xt+i^i\Xt^j). 

Then, each slice of the third mode Ck ■= ^■,-,k = I(jDfe(B)A, 
k ~ 1,2, - ■ ■ , K is the per observation letter and state prob- 
ability of the set y arranged in ascending ordeiQ. A key 
observation is the equivalence of C and Ig(B g)™'" A), its 
second mode matricisation. 

We now prove a necessary and sufficient condition on the 
uniqueness of the decomposition of C. 

Lemma 6: The per letter tensor C is unique up to a permu- 
tation and scaling of rows if and only if krank(B®™'" A) = q. 

Proof: Crucial to our argument is the central claim that 
it is necessary and sufficient that none of the non-trivial linear 
combinations of rows of B A is expressible by a tensor 
product of two row vectors, that is, krank(B A) = q. 
Necessity is proven by contradiction. We borrow a coun- 
terexample from |8|. If the first two rows can be expressed 
as a vector bi (g) ai + b2 a2 = bi eg) ai, an alternative 
decomposition of C is as follows: 

bi ® ai 
b2 ® a2 





"10 


I,(B A) - 


-110 




Ig 2 


= CT(B (g)™^ A) 



'This is one way of labelling the letters, as labelling is non-unique. 



(3) 



As is not a permutation or scaling, there is no unique 
decomposition for C. 

For sufficiency, we only need to verify |supp(x)| = 
|supp(Igx)| < |supp(Cx)| for all |supp(Cx)| = 1 (since x = 
is the only zero support vector) for some C, a component of 
an alternative decomposition of L, to satisfy Lemma |4l With 
an alternative decomposition 1,(6®'^™ A) = €^(6®™'^ A), 
then Vx, 

x'^(B A) = x'^C^(B A). 

Consider x with |supp(Cx)| = 1. Then, x'rcT(B «)™^ A) 
is just a scaled tensor product of one row of A and the corre- 
sponding row of B, by the above equation. If |supp(x)| > 1, 
then more than one row of B A is needed to represent 
a row of B 0'°'" A. This means a row of B 0'°^ A is not 
just a scaling and permutation, so |supp(x)| < 1 must hold. 
Kruskal's permutation lemma (Lemma |4|i then implies and 
C are equivalent up to a permutation and scaling of rows. 
Putting these arguments together implies the components of 
C i.e. A, B and 1^ are all unique up to a permutation and 
scaling of rows, and the result follows. ■ 

The above implies a sufficient condition. For necessity, 
suppose krank(B (g)™" A) < q, but the HMM is iden- 
tifiable. However, one can construct tt — ttC^, \q = 
(CT)-ii^,(B «)™^ A)E(fc) = (B A)(E(fc)CT),Vfc, 
with C as above in the proof of Lemma |6] resulting in a 
contradiction. Thus, krank(B ^5™^ A) = q is a necessary and 
sufficient condition for identifiability. 

Evaluating the Kruskal rank of B A is computationally 
difficult as it requires checking over all possible combinations 
of rows of a matrix. The computational complexity worsens as 
q and k become large. The conditions above may be weakened 
using the concept of coherence [5|, found in compressed 
sensing and dictionary learning literature to derive polynomial 
time algorithms for verifying HMM identifiability. Further 
details are found in I.16J . 

Remark 7: The results may be extended to non-stationary 
Markov chains. In this case, let A{t) and B(t) be the time- 
heterogeneous transition and observation matrices respectively. 
Then, Theorem |5] may be modified, essentially asserting that 
for each t = 1,2,--- ,iV, the conditions krank(B(t) 0''™ 
A(t)) = q must hold for the non-stationary HMM to be 
permutation and scaling indeterminate. 

V. Multi-observer HMM Setting 

The above results prove useful in the study of multi-observer 
HMMs, which have applications in machine learning ll 111 , and 
detecting attacks on Internet Service providers IfTZl . We shall 
study identifiability in this setting, but first, we need to lay the 
foundations for the multi-observer case. 

In the multi-observer setting, there are m > 2 observers 
of an underlying Markov chain. The observers are assumed 
independent to each other, since dependence would weaken 
the information content of their observations. The underly- 
ing irreducible, aperiodic Markov chain {Xt}t>i is being 
observed by all m observers, with each Xt G X, starting 



from initial state probabilities tt, drawn from a stationary 
distribution. Each observer j may have a different perspective 
of the chain, denoted by the processes {y/"''}t>i, with each 
yO) yU) Vj = 1, 2, • • • , m. Without loss of generahty, let 
X ^ {l,2,--- ,q} and for each j, 3^(j) = {1,2,-- - 

While the transition matrix of the hidden states A remains 
the same for all observers, their associated observation matri- 
ces may differ If at least two observation matrices are distinct, 
i.e. there exists indices £ and £' such that B^^^ 7^ B^^ \ the set 
of independent observers are called heterogeneous observers, 
otherwise they are called homogeneous observers. We model 
the separate observation matrices B*^^^ for each observer 
i — 1, 2, • ■ • ,rn, each of size q x Kj in the heterogeneous 
case, and B^^-' := B for all j in the homogeneous case. In 
either setting, the HMM is described by the parameter set 
Amuiti := {tt; m, q, {kjIJI^, A, {B^-?)}^ J, with appropriate 
modifications to the observation matrix depending on the 
setting. For the homogeneous setting, we let Kj := k and 
B(j) := B, Vj. 

As we shall see, whether the observers are heterogenous or 
homogenous makes a significant difference to the identifiabil- 
ity of the HMM. 

A. Multi-letter tensor 

Unlike the single observer scenario, there are multiple 
observations in a single time step. We first consider the 
heterogeneous case, where, without loss of generality, we 
define the multi-letter tensor an order m + 2 tensor of 

dimensions q x q x ki x ■ ■ ■ x Km, with component matrices 
[A,I„{B«}™i]. 

Just as in the case of a single observer, the multi-letter 
tensor is connected to the HMM via its matricisation. Let 
k' :— Hjli '^j- Thus, the joint probability of a particular 
observed sequence yi;y2,- ' ,yN, noting each observation 
is now a vector, may be described by 

Px{Yi = yi,l2 = y2, ■ ■ • = Yn) 

- 7rW,E(yi)W,E(y2) • • • W,E(yAr)l„ (4) 

where E(fc) is a n'q x q matrix, divided to k' row partitions, 
with the q X q identity matrix in the fc-th row partition, Iq is 
the q X 1 vector of ones, and 

m 

W, := (g)™^B(^) A 

= b(i) (g)™" B(2) . . . (g)™* B^") A. (5) 

In this formulation, all possible output sequences from space 
3;(i) X y^^^ X ■ ■ ■ X 37(™) are ordered lexicographically, with 
E(y) selecting the correct position of y out of this ordering. 
Note that is equivalent to since it is the second mode 
matricisation of the tensor Similarly, in the homogeneous 
setting, let Wo have the same structure as (|5]l, with B*^^^ := B, 
Vj, and k' = k™. We have the following result for 

Lemma 8: Assume m > 2. The heterogeneous multi-letter 
tensor Al* is unique up to permutation and scaling of rows 
if and only if krank(0''li ™" B^J) A) = q. 



Proof: The proof is similar to the proof of Lemma |6] 
extended to the multidimensional case, thus, we only need 
to sketch the proof here. We claim that it is necessary and 
sufficient that none of the non-trivial linear combinations of 
rows of ® JLi B^^^ A is expressed by a tensor product 
of two row vectors, i.e. krank((g)Jli ™™ B^J) 0™™ A) = q. 
The chief ingredients are, (1) show a counterexample to proof 
necessity, where the same example from the proof of Lemma 
|6]can be used, appropriately modified to account for additional 
dimensions, to construct an alternative decomposition of 
and (2) for sufficiency, show that Ig, the full rank component 
of A4, satisfies Kruskal's permutation lemma. Then, the claim 
is established. ■ 

We must, however, be careful when the observers are 
independent and homogeneous. In this scenario, the above 
result no longer holds. Instead, the homogeneous setting is 
equivalent to the single observer setting, evidenced by the 
following result. 

Lemma 9: It is necessary and sufficient that krank(B cg)'°* 
A) = q for the homogeneous multi-letter tensor A4o to be 
unique up to permutation and scaling of rows. 

An intuitive explanation is that each additional component, 
B, is exactly the same and do not provide sufficient vari- 
ability for unique decomposition. Thus, even if the tensor 
A^o is matricised in different ways, one can always find an 
alternative decomposition of Alo, similar to an example by 
Stegeman et al. [15] for the 3-way tensor It is for this reason, 
decomposition-wise, A4o is no different from the single letter 
tensor C. 

B. Identifiability conditions 

Theorem 10: Suppose the m observers are independent and 
homogeneous. For an HMM parameterised by A„iuiti to be 
unique up to a scaling and permutation of states, it is necessary 
and sufficient that the per letter tensor of the HMM satisfies 
Lemma m i.e. krank(B A) = q. 

Proof: The proof follows from the properties of A^o- 
Then, it is clear krank(B A) = q \f and only if 

krank({g)" ™* B A) = q, from Lemma |9l otherwise 
an equivalent HMM can be constructed, such that the original 
HMM is no longer permutation and scaling indeterminate. ■ 

The result shows that the homogeneous setting is essentially 
equivalent to the single observer setting. As mentioned in lfT2l . 
if an HMM is unidentifiable in the single observer case, it is 
also unidentifiable in the multiple independent homogenous 
observer case, as there is not enough variability in M.o- Thus, 
no matter how many independent homogeneous observers are 
present, if the model cannot be identified in the single observer 
setting, then the model remains unidentifiable. 

We next turn our attention to the heterogeneous case, a 
consequence of Lemma |8] 

Theorem 11: Suppose the m observers are independent and 
heterogenous. For an HMM parameterised by Amuiti to be 
unique up to a scaling and permutation of states, it is necessary 
and sufficient that the multi-letter tensor of the HMM satisfies 
Lemma [H i.e. krank({g)JLi ™" B^^) (g)'°* A) = q. 



The extra dimensions of At* are related to the additional 
advantage of having multiple independent observers. Each 
single observer j = 1,2,- ■• ,rn is essentially restricted to 
a per letter tensor C'^-''^ which may not satisfy Lemma |6] 
individually, but satisfies Lemma [8] when their observations 
are jointly considered. This proves a natural advantage mul- 
tiple independent heterogeneous observers have over a single 
observer, used to great effect in llTZl . 

VI. Conclusion 

In this paper, we revisit the identifiability of hidden Markov 
models, via tensor decomposition, where there are well-known 
results regarding the permutation and scaling indeterminacy 
of tensors. We proved deterministic identifiability conditions 
of single and multi-observer HMMs using well-established 
results on the Kruskal rank. Our results are stronger than 
previous results, where only generic identifiability based on 
the regularity of the HMM is assumed. Future work includes 
using our framework to provide insights in the inference of 
the transition and observation matrices of HMMs and extend 
the work to dependent observers. 
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